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ERROR-FREE MILESTONES IN ERROR PRONE 
MEASUREMENTS 
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University of Pennsylvania 

A predictor variable or dose that is measured with substantial 
error may possess an error-free milestone, such that it is known with 
negligible error whether the value of the variable is to the left or 
right of the milestone. Such a milestone provides a basis for estimat- 
ing a linear relationship between the true but unknown value of the 
error-free predictor and an outcome, because the milestone creates a 
strong and valid instrumental variable. The inferences are nonpara- 
metric and robust, and in the simplest cases, they are exact and 
distribution free. We also consider multiple milestones for a single 
predictor and milestones for several predictors whose partial slopes 
are estimated simultaneously. Examples are drawn from the Wiscon- 
sin Longitudinal Study, in which a BA degree acts as a milestone 
for sixteen years of education, and the binary indicator of military 
service acts as a milestone for years of service. 

1. Introduction: strong, valid instrumental variables from error-free mile- 
stones. 

1.1. Error-free milestones. A fallible measure contains an error-free mile- 
stone if there is some value of the measure, called the milestone, such that, 
despite errors of measurement, the measurement is always on the correct 
side of the milestone. Error-free milestones arise in a variety of ways. It may 
happen that a nonnegative quantity may contain errors when it is strictly 
positive, but a zero is truly and exactly a zero; an example involving du- 
ration of exposure to anesthetics is discussed in Section 1.2. If a scale is 
defined in terms of many item responses, then for some possible definitions 
of the scale, an error free item yields an error free milestone; an example in- 
volving a scale of exposure to combat in Vietnam is discussed in Section 1.3. 
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If deceit is distinguished from error, then the concept of an error- free mile- 
stone (as distinct from a deception free milestone) is relevant to responses 
to questions; see Section 1.4. 

In practice, an error-free milestone is a model intended to approximate 
situations in which errors that respect the milestone are commonplace and 
errors that cross the milestone are extremely infrequent. In Section 1.2 im- 
precision in recording the duration of anesthesia respects a milestone at 
zero, whereas failing to bill for an operation causes an error that crosses the 
milestone; however, there are strong disincentives for the latter error. 

1.2. Inhaled anesthetics and neurodegenerative disorders. Measurable cog- 
nitive dysfunction may occur in perhaps 20% of patients one week after 
surgery with an inhaled anesthetic [Johnson et al. (2002)], but long term 
effects in humans have not, so far, been demonstrated. Eckenhoff et al. 
(2004) provide in vitro laboratory evidence suggesting that the anesthet- 
ics halothane and isoflurane enhanced cellular changes associated with the 
development of neurodegenerative disorders such as Alzheimer and Parkin- 
son disease. 

Mounting a large scale, long term study in humans faces several significant 
obstacles, including (i) measurement of the duration of anesthetic exposure, 
(ii) measurement of neurodegenerative outcomes, and (iii) confounding of 
the need for surgery with effects of anesthetics given during surgery. Jeffrey 
Silber, Roderic Eckenhoff and one of the authors (Rosenbaum) have pro- 
posed to use data from Medicare as the basis for such a study. Medicare is 
the program of the U.S. government which provides publicly financed health 
care to people of age 65 or greater. With some exceptions, doctors and hos- 
pitals bill Medicare for services provided to the elderly, and these Medicare 
claims create a national record of health care for Medicare recipients. 

If you fall in a certain way and break your hip, it is likely that you will 
need hip surgery requiring prolonged anesthesia; for Medicare recipients, 
these events will be recorded in Medicare claims. If you fall in a slightly 
different way and break your pelvis, it is likely that your condition will be 
treated without surgery, and hence without inhaled anesthesia; these events, 
also, will be recorded in Medicare claims. Comparing patients who broke 
either a hip or a pelvis is an example of using differential treatment effects 
to remove confounding from generic biases [Rosenbaum (2006)]. 

Like lawyers, and unlike surgeons, anesthesiologists bill for their time. 
Anesthesiologists submit a bill to Medicare which records the duration of 
anesthetic care. Silber et al. (2007) compared the times recorded in these 
bills to times obtained by chart abstraction for 1931 patients in Pennsyl- 
vania. The bills were typically in close agreement with the chart abstrac- 
tions, with a median absolute difference of five minutes, but as seen in the 
quantile-quantile plot in Figure 2 of Silber et al. (2007), the distribution is 
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approximately symmetric with extremely long tails, with more than 1% of 
bills discrepant by more than an hour. The cause of these large discrepancies 
is not known, and could conceivably be errors in abstraction rather than in 
bills; however, we suspect that our algorithm for record linkage sometimes 
makes a few gross errors, possibly due to errors in dates on bills. Silber et 
al. (2007) conclude that anesthesia bills can be used to gauge anesthesia 
duration, providing robust methods are used to prevent the long tails from 
having inappropriate influence. 

Although the anesthesia bills measure anesthesia duration with moderate 
but long tailed error, it is virtually certain that a patient who did not have 
surgery had no exposure to inhaled anesthetics. In other words, the nonzero 
anesthesia duration for a broken hip will contain error, but the zero duration 
for a broken pelvis will truly be zero, creating an error-free milestone at zero. 

1.3. Scales in which certain levels may he verified using administrative 
records. Lund et al. (1984) created a seven point scale of the degree of 
exposure to combat during the Vietnam War. Certain points on the scale 
can be determined objectively from military records; others depend on self- 
report. For instance, "in military service during 1965-1975" by itself scores a 
0, whereas that combined with "stationed in Vietnam" scores a 1, while both 
of these together with "saw injury or death of U.S. Serviceman" scores a 2, 
and so on, and "wounded in combat" scores a 5. Military records indicate 
when and where an individual has served and whether the individual was 
wounded in combat, but there is no record of whether an individual saw the 
injury or death of a U.S. Serviceman. A misstatement by an individual may 
result in erroneous placement on the scale, but only within the milestones 
created by the scale's dependence, at certain points, on objective records. 

Expressed more abstractly, it is common to combine several oriented 
pieces of information or items to form a scale. Here the scale is the de- 
gree of exposure to combat and the items are such events as "wounded in 
combat." With m binary items, the 2"^ possible patterns of item responses 
are partially ordered, for instance, a person who is positive for items 1 and 
2 and for no other items is at least as high in the partial order as a person 
who is positive for item 2 and no other item, etc. In rare instances, the 
patterns that actually occur form a linear order or Guttman scale, so only 
m + 1 patterns of the 2"* possibilities actually occur. More commonly, the 
definition of the scale imposes a linear order that is compatible with (i.e., 
is a linear extension of) the partial order on the 2™ possible patterns. If 
some of the items are error free and others are error prone, then it is always 
possible to define the linear order or scale so that it gives lexicographic pri- 
ority to at least one of the error- free items, thereby creating an error- free 
milestone; see the discussion of the lexicographic sum of partial orders in 
Trotter (1992), page 24. Whether or not such a scale will be reasonable as a 
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scale obviously depends upon the content of the specific items involved, but 
the mere existence of scales with error-free milestones is guaranteed by the 
existence of at least one error-free item. 

As noted by Dee, Evans and Murray (1999), in longitudinal data for re- 
search in education and labor economics, it is increasingly common to com- 
bine transcripts from educational institutions with survey questionnaires. 
Although this does not appear to have been done as yet, in parallel with 
Lund et al. (1984), one could create educational scales anchored by mile- 
stones determined from transcripts, for instance, receipt of particular aca- 
demic degrees. 

1.4. Milestones to anchor memory. Measurements that describe people 
are often obtained by asking them questions. How many years of educa- 
tion do you have? How long did you serve in the U.S. military? How many 
cigarettes do you smoke per day? To what extent are you prone to violent 
behavior? In asking such questions, an investigator hopes that the respon- 
dent can remember the answer, and can express the answer in a manner 
consistent with the investigator's operational definitions, but these hopes 
are not always realized. Aspects of questionnaire design are discussed by 
Sudman and Bradburn (1986), Lyberg (1997) and Tourangeau, Rips and 
Rasinski (2000). 

For instance, by "years of education," most investigators mean "grades 
successfully completed," not years spent trying. Imagine a person who dropped 
out of high school in the middle of tenth grade, having repeated grades three 
and seven. Such a person might think of this as twelve years of education 
(ten plus two), whereas the investigator might intend this to be classified 
as successful completion of grades one through nine, or nine years of edu- 
cation. Similarly, a person who achieves a BA degree with three years of 
college, a summer session after the freshman year and some advanced place- 
ment credit might report fifteen years of education, whereas the investigator 
might intend to credit sixteen years of education for achievement of a BA. 

A respondent may intend to report accurately, but may fail to do so be- 
cause of lapses of memory and uncertainties about the intended meaning of 
the question. Certain events, however, are easy to remember and unambigu- 
ous in question and answer: they are events punctuated by public ceremony, 
official sanction, public documents, and by kinds of behavior rather than 
degrees of behavior. An honest, sober, mentally competent respondent is 
unlikely to err in response to the following questions: Do you have a high 
school degree or high school equivalency degree? Did you ever serve in the 
U.S. military? Have you smoked any part of at least one cigarette in the last 
seven days? Have you ever been convicted for assault? If scales of behavior 
are defined in terms of such unambiguous milestones, and if questioning is 
organized to ensure that the milestone is respected in responses, then the 
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milestones may be measured with negligible error, despite continued errors 
at points in the scale between milestones. 

1.5. Outline. Our purpose here is formalize these considerations, show- 
ing how error-free milestones permit estimation of slopes for the true but 
unknown error- free measurements. Speaking informally, almost by definition 
of the scale itself, an error-free milestone creates a strong and valid instru- 
mental variable, so that location with respect to the milestone is related to 
the true measurement but is uncontaminated by measurement error; see Sec- 
tion 2 for formal definitions and results. The inferences are nonpar ametric 
and robust, and in the simplest cases they are exact and distribution free. 
In Section 2 the most common and simplest case is discussed, namely, a sin- 
gle milestone for a single variable, first for matched pairs using Wilcoxon's 
signed rank test, then for matched sets formed by full matching using a 
generalization of the signed rank test. In Section 4, multiple milestones are 
considered, including several milestones for one predictor, single milestones 
for each of several predictors, or several milestones for each of several pre- 
dictors. The theory in Sections 2 and 4 is applied in Sections 3 and 5, 
respectively, to the example in Section 1.6. 

Our use of the term instrumental variable departs slightly from the tradi- 
tional definition, which is stated in terms of covariances; see Cheng and Van 
Ness (1999), Section 4, for review of the traditional definition. The litera- 
ture on correcting for measurement error is extensive; see also Kendall and 
Stuart (1973), Section 29, Fuller (1987), Brenner and Gefeller (1993) and 
Carrol, Ruppert and Stefanski (1995) for several perspectives. The method 
discussed in Rosenbaum (2005) may be viewed as a special case in which 
the error-free milestone occurs between dose zero and all positive doses, in 
which case it was possible to correct for measurement error using controls 
known to have received dose zero of a treatment. The notion of error- free 
milestones is substantially more general, however, in that all doses may be 
affected by errors, and several milestones may be available. 

1.6. Years of education in the Wisconsin Longitudinal Study. Traditional 
questions in sociology and labor economics concern the effects of additional 
schooling or of service in the military. The Wisconsin Longitudinal Study 
(WLS) provides especially detailed information, including an IQ test score 
from high school, and several measures of education. For two of the many 
empirical studies based on the WLS, see Singer et al. (1998) and Warren, 
Sheridan and Hauser (2002). We focus on the 3738 men with wages of at 
least $100 in 1974. The WLS began its data collection with surveys in the 
senior year of high school, which in the U.S. is conventionally recorded as 12 
years of education, with kindergarten and preschool ignored. In WLS, the 
variable edyrcm is self-reported years of education beyond high school, which 
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we use in the form SR = edyrcm + 12, where SR signifies "self-report." The 
second measure, edeqyr, is a scaled measure of education based on equivalent 
degrees actually earned (DS for "degree scaled"), for example, 16 years for 
a BA, 20 years for a Ph.D., etc. Using DS, we create a binary indicator of 
whether the individual reports having a BA degree. These two measures of 
education, SR and DS, often differ by a few years, but they are in substantial 
agreement at 16 years of education for the BA. Although not collected in 
precisely the manner suggested in Section 1.4, to a close approximation, SR 
does seem to have the BA degree as an error-free milestone: in SR, all but 
29/3738 = 0.008 < 1% of the men reported less than 16 years of education if 
no BA was received or at least 16 years of education if a BA was received. 

Figure 1 contrasts three measures of education in the WLS, including the 
degree scaled education, DS, and the self reported education, SR. The DS 
and SR differ for 470 = 12.6% of the men, mostly by one year, but discrep- 
ancies as large as seven years do occur. We assumed that the report of a BA 
or not was accurate, and created a third measure, the adjusted self report or 
SRa, which minimally altered the 29/3738 = 0.008 < 1% of the men whose 
self-reported years were inconsistent with 16 years for the BA. Specifically, 
two men who reported a BA with 15 years of education were credited with 16 
years of education, 24 men who reported no BA with 16 years of education 
and 3 men who reported no BA with 17 years of education were credited 
with 15.999 years of education. This adjustment would not be necessary if 
the questionnaire forced compliance with the milestone. Of course, because 
only small changes were made to 29/3738 records, in Figure 1, SR and SRa 
are indistinguishable, but both differ somewhat from DS. For the purpose 
of illustration in the current paper, we act as if SRa were a fallible measure 
of DS with a milestone at 16 years. 

In Sections 3 and 5, we estimate the relationship between log earnings 
in 1974 and education correcting for errors of measurement in self-reported 
education using the BA degree as a milestone for 16 years of education. 

2. Inference using one milestone. 

2.1. Doses measured with random errors. There are / matched sets or 
strata, i = 1,...,/, matched exactly on covariates x, and matched set i 
contains nj > 2 individuals, j = l,...,nj. The jth individual in set i has 
covariates Xj^ , true but unobserved dose dij , fallible observed dose Dy , and 
outcome Yij. Here, Xjj and dij are viewed as fixed, perhaps fixed by condi- 
tioning as in a regression model, but Dij and Yij are random variables, in 
part because Dij measures dij with an error of measurement. Because the 
matching is exact, Xjj = Xjj/ for all i, j, j' . See Cochran (1968) for some 
discussion of the consequences of close but inexact control for x. 
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Years of Education: DS = Degree Scaled Difference: DS - SR 




I 1 r 1 ( 1 1 

10 IS M 25 -SOS 

Yeara {edet^r} Difference in Years 



Years of Education: SR = Self Report Different: DS - SRa 
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Years of Education: Sfla = Self Report, Adjusted Difference: SR - SRa 
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Fig. 1. Three measures of education compared: degree scaled (DS), self-report (SR) and 
self-report minimally edited to be compatible with a BA = 16 years (SRa). There are a 
fair number of small discrepancies between DS and SR, and a very small number of larger 
discrepancies (up to seven years). There are only 29/3738 discrepancies between SR and 
SRa, of which 27 are about one year, and two are two years. 



Let C be the set of continuous distribution functions on the real hne, and 
let S be the subset of continuous distribution functions on the line that are 
symmetric about zero. The true but unknown dose dij is assumed to be 
linearly related to Yij, 

(1) Yij = X{xij) + /3dij + Eij , Eij'''^' G £C, 

where /3 is the parameter to be estimated, and A(-) is an unknown function. 
The fallible, observable dose Dij measures the true but unknown dose dij 
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with errors S^ij that are symmetric about zero, are mutually independent, 
and independent of the £ij, 

(2) Dij = dij + Cij, Cij ~ Fdij G S, 

so the distribution of measurement errors, F^^., varies with the true dose 
dij, but Dij is always symmetrically distributed about its center or median, 
namely, dij. So far, (1) and (2) slightly generalize the traditional errors- in- 
variables regression model [Wald (1940), Neyman and Scott (1951), Madan- 
sky (1959), Kendah and Stuart (1973), Section 29, Fuller (1987), Section 1, 
Cheng and Van Ness (1999)], notably because the distribution F^.- of errors 
S^ij need not be the same for all true doses dij . To say that Dij measures dij 
with error, there must be some sense in which the error Dij — dij = ^ij is 
typically zero, and the symmetry of the distribution F^.. of ^ij about zero in 

(2) is one such sense. Later, we remove the assumption of symmetry in (2), 
replacing it by the assumption that E[^ij) = 0, but for now ^ij is symmetric 
about zero. For instance, ^ij might have a rescaled and relocated symmetric 
beta distribution with median zero and with a range and a shape that might 
vary in some way with dij. If Dij — dij is centered, say, at positive value, 
then the doses are systematically biased, and the methods we propose apply 
to random errors of measurement but not to systematic biases. 

It is well known that (5 is not identified under models (1) and (2). In- 
deed, even if one assumed much more, say, that A(xjj) = a for all Xjj, 
and Eij ~ iV(0, (Tg), S,ij ~ A^(0,cj|), dij ~ N{fid,cr'j) with unknown cr^ > 0, 
o"^ > 0, (7^ > and a, then: (i) there would be no consistent estimate of /?; 
(ii) the likelihood function would have a ridge rather than a unique max- 
imum; (iii) least squares regression of Yij on Dij would be consistent for 

f^'^d/(^d + 7^ where Dij has reliability o-^/(o"^ + o-|) < 1 as a measure 
of dij-, see Cheng and Van Ness (1999), Section 1.2.1. 

2.2. Definition of a milestone. The number k is defined to be an error- 
free milestone, or briefly a milestone, for {Dij, dij) if 

(3) Dij < K dij < K, Dij > K dij > K, ^i,j- 

In the WLS example in Section 1.6, with k = 16 years of education, (3) says 
that a respondent might misreport dij years of education as Dij years of 
education because of a lapse of memory or a miscommunication about the 
investigator's operational definition of what counts as a year of education, 
but an honest, mentally competent respondent could not misunderstand or 
forget the answer to the question: "Did you receive a BA degree?" 

Obviously, a milestone at k in (3) places a restriction on the range of the 
distribution F^.^ of the error of measurement S^ij. If (3) is true with k = 16 in 
Section 1.6, then a man who reports Dij = 18 years of education has at least 
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dij > K = 16 years of education, so he exaggerates his education by at most 
two years, = Dij — dij < 2. Similarly, a man who reports Dij = 14 years of 
education has at most dij < k = 16 years of education, so he understates his 
education by at most ^ij = Dij — dij > —2 years. In general, if Dij > k, then 
dij > K so that S^ij = Dij — dij < Dij — k, whereas if Dij < k, then dij < k so 
that = Dij — dij > Dij — n. This range restriction is respected by various 
parametric families of distributions F(l^j for ^ij in (2) which are symmetric 
about zero, F^^ - € 5, including the symmetric beta distributions relocated 
and rescaled to have median zero with support contained in the interval 
[ \dij k|, \dij — 

2.3. A basic property. Consider testing the hypothesis Hq : /3 = /3o in (1) 
and (2) using the error-free milestone (3). Recall that the matching on x is 
exact, Xjj = Xjfc. If a matched set i contains an individual j with Dij > k 
and another individual k with Dij. < k, then compute 

Qfk = O^ij - /3oA,) - {Y^k - PoDik) 
(4) = f3{dij - dik) - PoiDij - Dik) + {sij - Sik) 

= iP - Po)idij - dik) - Poi^ij - Lk) + i^ij - £ik)- 

Because k is a milestone in (3), dij — dik>0 in (4). Also, because the ^ij, ^ik, 
Eij, Eik are mutually independent with distributions satisfying the conditions 
in (1) and (2), the quantity —Po{S,ij —dk) + {^ij —^ik) in (4) has a continuous 
distribution symmetric about zero. If Hq : (3 = (3q were true, then Q^^°^ m 
(4) would be symmetrically distributed about zero. If Hq ■.(3 = (3q were false 

with (3 > Po, then Q^jjl^ would be symmetrically distributed about a positive 

quantity, whereas if /? < /3o, then Q^jk^ would be symmetric about a negative 
quantity. 

The symmetry of Q^jl^ about (/3 — Po){dij — dik) also holds under certain 
variations of the model in (1) and (2). In particular, the Eij need not all have 
the same distribution G (zC; rather, they could have different distributions 
that are symmetric about zero, Eij ~ Gd,^ e S, and then would stih be 
symmetric about (/3 — Po){dij — dik)- In Section 1.6, for instance, a person 
with dij = 18 years of education might have either a law degree or a masters 
degree in art history, so the Eij for wages Yij might be more variable at 
dij = 18 than at dij = 12, so Gis might be more dispersed than G12, but 

providing Eij ~ Gd^. € S, in (4) the quantity Q^j'k is symmetric about (/3 — 
Po){dij — dik). 

If one replaces all assumptions of symmetry of (,ij or Eij by the assumption 
that E{Cij) = E{E^j) = 0, then Q^f^'^ in (4) has expectation E{Qf-l^} = {13- 

Po){dij — dik), and, in particular, E{q[^^^} = if Hq : (3 = I3q is true. 
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2.4. Inference with matched pairs. Suppose that k is a milestone (3) 
for {Dij,dij) and I pairs, = 2, i = I, . . . ,1, are matched exactly for Xjj 
with the additional requirement that Dn > n > Di2, or, equivalently, the 
requirement that dn > k > di2- In Section 1.6 this would mean pairing 
someone with at least a BA to someone with less than a BA. Although 
(Ai, A2) = (dii +(,ii,di2 + 62) is a random quantity because (Cii,Cj2) is 
random, the event Dn > k> Di2 is determined by {dii,di2), which is fixed. 
In Section 1.6 this would mean that, although there are random errors in 
reported years of education (L>ji> A2)) the pairing of someone with at least 
a BA to someone with less than a BA is made without error: in each pair, 
the person claiming to have a BA has one, and the person claiming not to 
have a BA does not have one. 

To test Ho'.p = (3q in (1) and (2), calculate the / mutually independent 

differences, Q^f"^ = {Yn - poDn) - {Ya - /3oA2). The Q^j^} are symmetri- 
cally distributed about (/? — f3o){dii — di2) by (4), where dn — di2>0 because 
the pairing ensured dii> n> di2 - Let r^,j be Wilcoxon's signed rank statistic 

[e.g., Hettmansperger and McKean (1998), Section 1] computed from (5ii2\ 

that is, rank the |Qif2^| from 1 to /, and let T^q be the sum of the ranks 

for which q\^^ >Q.li Hq:(3 = I3q is true, then sign{Q\^f} and are 
independent, where sign(a) = 1, 0, or —1 as a > 0, a = 0, a < 0; see Wolfe 
(1974), Corollary 2.1. So if Hq:(3 = Pq is true, then the conditional distri- 
bution of given the |Q|f2^| is the usual exact distribution of Wilcoxon's 
signed rank statistic, namely, the distribution of the sum of / independent 
random variables taking values z or each with probability |, i = 1, . . . ,/. 
Therefore, Tp^ yields an exact, distribution free test of Hq : (3 = Pq. If /3 > /3o, 

the Q-f2'' are symmetric about (/? — /9o)('^ii — di2) > 0, so the test based on 
is consistent against Hi : /3 > f3o under mild conditions on the limiting 
behavior of the fixed djj's and of the F^^j as I ^ 00. Similarly, the test is 
consistent against Hi : (3 < I3q. A 1 — a confidence set for (3 is formed by 
inverting the test, that is, as the set of hypotheses Hq : (3 = f3o not rejected 
by a level a test. Because Dn > k> Di2, the difference Q-f2'' is strictly de- 
creasing as a function of /3o, so the signed rank statistic T^^ is monotone 
decreasing as a function of f3o, which implies that this confidence set is an 
interval. Under Hq : P = (3q, the null expectation of the signed rank statistic 
is I{I + l)/4. The Hodges-Lehmann (1963) point estimate /3 of /? is the 
"solution" to the estimating equation, T^ = I{I + l)/4, in a sense that will 
now be described. Because the rank statistic T^q takes many, small discrete 
steps downward as (3o increases continuously, there is either a unique value, 
/?, of /3o where T^^ passes /(/ + l)/4, or there is an interval of values of (3q 
where Tp^ = I{I + l)/4, in which case the "solution" /? is defined to be the 
midpoint of this interval. 
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An alternative estimator uses sample means rather than rank statis- 

tics. Write q[^"^ = {I /I) E Q^fa"^ = {Yi - /3o^i) - (X2 - /?o^2), where Fi = 
(l//)E^i) etc. Assume in this paragraph only that the i.i.d. e^j's have fi- 
nite variance and that the ^ij's, which are not i.i.d., have uniformly bounded 

variances, li Hq: [5 = (i^ is true in (1) and (2), then E{Q^i2^} = 0. The esti- 
mating equation Q^^2 = has solution (3 = {Yi — Y2)/{Di — D2), which is 
Wald's (1940) estimator, or two-stage least squares, in a context that avoids 
the concerns raised by Neyman and Scott (1951). In /3, the denominator 
has positive expectation, E{Di — D2) > because Dn > k> Di2, and f3 is 
consistent for P under mild conditions on the limiting behavior of the fixed 
djj's and of the Fd^- as I — 00. In parallel with the procedures above us- 
ing the signed rank test, a one-sample t-statistic may be computed from 
the Qii2^- This t-statistic does not have a t-distribution, in part because 
the Cij's in q\^°'^ are not i.i.d. Normal random variables, and their vari- 
ances may change with dij. With i.i.d. Normal matched pair differences, the 
Pitman asymptotic relative efficiency of the signed rank statistic and the t- 
statistic is S/vr = 0.955, and Sen's (1968) Theorem 2.2, result shows that the 
relative efficiency is always greater than or equal to S/vr, often much greater 
than 1, with Normal distributions having unequal variances. In short, in this 
context, the signed rank statistic is robust to outliers, has a known finite 
sample null distribution, and has the possibility of superior efficiency relative 
to the t-statistic. The procedures based on means do have one advantage: 
unlike the signed rank statistic, they yield consistent inferences as / — 00, 
assuming E{(^ij) = without the assumption that the ^ij are symmetrically 
distributed about zero. 



2.5. Inference with matched sets. In a full matching, each matched set 
with Hi > 2 individuals contains either 1 individual with Dij > k and — 1 
individuals with Dij < k or else — 1 individuals with Dij > k and 1 individ- 
ual with Dij < K. Matched pairs, as in Section 2.4, and matching with a fixed 
number of controls are special cases of full matching. It can be shown that 
the stratification or matching that minimizes the total distance on x within 
matched sets is always a full matching, and an optimal full matching — 
one that minimizes the total distance within matched sets — may be con- 
structed by solving a standard combinatorial optimization problem, tech- 
nically known as minimum cost flow in a network [Rosenbaum (1991), Gu 
and Rosenbaum (1993), Hansen (2004, 2007), Hansen and Klopfer (2006)]. 

Such a matched set creates — 1 differences Q^^l^ of the form (4); however, 
these rii — 1 differences are now dependent because one of the -Dj/s, say, 
Dii, appears in all iii — 1 differences. If Hq:[3 = /3o were true in (1) and 
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(2), each Q\jj!. in (4) would be symmetric about zero, and the rij — 1 dif- 
ferences would have a joint distribution with a form of reflection symmetry 
about described by Sen and Puri (1967); specifically, {Q^u , ■ ■ ■ , QI'^iIJ 
would have the same distribution as {—Qii2\---:~Qi^iln)- If Hq:[5 = (3q 

were true, for any statistic that is a function of the qI^^\ the reflection 
symmetry yields a null permutation distribution formed by changing the 
signs of the / vectors {Q\i2\ • ■ • > Qfii) m aU 2^ possible ways; see Sen and 
Puri (1967) and Rosenbaum (2005) for details. For instance, in Section 3.2 
the usual Wilcoxon signed rank statistic is compared with this unusual per- 
mutation distribution which correctly allows for dependence in matched sets 
with rii > 2; see Rosenbaum (2005) for a computational illustration. 

3. An example with a single milestone: education and earnings. 

3.1. Full matching to control for IQ, parent's education and home town. 
The method of Section 2 will be applied to the example in Section 1.6, using 
the BA degree as a milestone for 16 years of education in the self-reported 
years of education, SRa. We contrast the results with least squares and 
Ruber's (1981), Section 7, m-estimation using SRa, ignoring measurement 
error and using a linear model for the covariates Xjj. In m-estimation, we 
used the defaults for rim in the MASS package in R. Under the simplest 
models for errors of measurement, we expect the slope estimates from least 
squares and m-estimation to be attenuated, or biased toward zero, and the 
estimate using the milestone to be consistent. The degree scaled measure 
of education, DS, used as the standard for comparison, is analyzed in a 
parallel manner. In practice, the simple measurement error models may be 
incorrect, and the methods differ in several ways, but the comparison serves 
as an illustration. In Section 3.1 the matching is described, while in Section 
3.2 the estimated economic returns to additional education are compared. 

In the Wisconsin Longitudinal Study in Section 1.6, there were 1124 men 
with a BA degree, and 2614 men without one, 3738 = 1124-^2614. The 1124 
men with a BA were matched to 1124 men without a BA. The matching 
controlled for a four dimensional x, whose coordinates were IQ in high school 
(specifically gwiiq_bm), father's education in years (edfaSTq), mothers edu- 
cation in years (edmo57q) and the population size of the town in which the 
individual attended high school (popl5). Parental education was missing in 
whole or in part for 432 men, and an effort was made to match men with 
missing parental education to other men with missing parental education. 

Pair matching is not feasible in these data, because the distributions of 
x are quite different for males with a BA and males without a BA. This is 
seen for IQ in Figure 2 which depicts the IQ's for the 1124 males with a BA 
and the 1124 highest IQ's for males without a BA among the 2614 males 



ERROR-FREE MILESTONES 
Limited Overlap: All IQ's With BA and Highest IQ's Without BA 



13 




No BA wilh Highest iO's. n=1124 



Fig. 2. IQ scores for all 1124 males with a BA and for the 1124 highest IQ scores among 
the 2614 males without a BA. The figure shows that pair matching is not feasible, even if 
matching on IQ were the only objective. 



without a BA. Even the 1124 highest IQ's without a BA are too low to form 
an acceptable match. Moreover, these 1124 highest IQ's would constitute 
a poor match, in part because they ignore the other three covariates, and 
in part because some lower IQ's are needed to match to males with BA's 
having lower IQ's. 

In place of pair matching, a full matching was performed, with a maximum 
2-to-l ratio, using ah 1124 males with a BA and 1124 males without a BA. 
This means that a matched set might be a matched pair or a matched triple. 
A pair consists of a male with a BA and a male without a BA, and there 
were 239 such pairs. A triple may consist of either a male with a BA and 
two without a BA, or two with a BA and one without, and there were 295 
triples of each type. That is, there were 829 = 239 + 295 + 295 matched sets, 
containing 1124 = 239 + 295 + 2 x 295 males with a BA, and the same number 
without a BA. At higher IQ's, two men with a BA might be matched to one 
without a BA, with the reverse pattern at lower IQ's. As noted in Section 
2.5, full matching is the form that minimizes distances within matched sets 
[Rosenbaum (1991)], and an implementation of optimal full matching is 
available in the optmatch package in R [Hansen (2004, 2007), Hansen and 
Klopfer (2006)]. Haviland, Nagin and Rosenbaum (2007), Appendix, present 
a general result about efficiency from matched sets with varied match ratios, 
and the 1-2 limit on imbalance is quite efficient. The distance used was the 
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Fig. 3. Four covariates before and after full matching. "Unmatched" refers to all pairwise 
differences, BA-minus-no-BA. "Matched" refers to the differences m means within 829 
matched pairs or triples. 



Mahalanobis distance on the ranks of the four variables, with two additional 
variables containing binary indicators of missing parental education. Because 
the Mahalanobis distance is affinely invariant and missing indicators are 
included, any value may be substituted for missing values without altering 
the Mahalanobis distance. Figure 3 shows the four covariates before and after 
full matching. Each covariate is represented by a pair of boxplots, one before 
matching, the other after matching. The boxplot before matching compares 
the 1124 males with a BA to the 2614 males without a BA by taking all 
2,938,136 = 1124 x 2614 differences. The boxplot after matching describes 
one number for each of the 829 matched sets, namely, the BA-minus-no-BA 
difference in means within a matched pair or triple. After matching, the 
differences are close to zero. 
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Table 1 

Estimates of percent returns to an additional year of education, using degree-scaled (DS) 
or self-reported (SRa) schooling. Least squares and m-estimation make no correction for 
measurement error. The two milestone methods use the BA as a milestone for 16 years 
of education, that is, as an instrumental variable. All methods adjust for four covariates. 
The table gives the point estimate, large sample 95% confidence interval and a standard 
error (se), which for the milestone estimate is the length of the 95% interval divided by 

2 X 1.96 



Method 


Sample size 


Variable 


/3 


se 


95% CI 


Least squares 


3306 


DS 


0.035 


0.0040 


[0.027,0.043] 


Least squares 


3306 


SRa 


0.023 


0.0036 


[0.016,0.030] 


m-estimation 


3306 


DS 


0.038 


0.0027 


[0.032,0.043] 


m-estimation 


3306 


SRa 


0.030 


0.0025 


[0.026,0.035] 


Milestone Wilcoxon 


2248 


DS 


0.044 


0.0036 


[0.037,0.051] 


Milestone Wilcoxon 


2248 


SRa 


0.041 


0.0036 


[0.034,0.048] 


Milestone TSLS 


3306 


DS 


0.038 


0.0045 


[0.029,0.047] 


Milestone TSLS 


3306 


SRa 


0.035 


0.0043 


[0.027,0.044] 



3.2. Inference about economic returns to education. Table 1 contrasts the 
eight estimates of economic returns to additional years of education, mea- 
sured using log wages in 1974. Of the eight estimates, four are based on the 
better degree scaled education, DS, and four are based on self-report, SRa. 
Two methods, least squares and m-estimation (with R's defaults), make no 
correction for errors of measurement in SRa, whereas the third and fourth 
methods use the BA as a milestone for 16 years. In the third method, as 
described in Section 2.5, the special permutation distribution of Wilcoxon's 
signed rank statistic is used. The fourth method uses two-stage least squares 
with the milestone as the instrumental variable; however, conventional two- 
stage least squares actually requires more than (1) and (2), whereas these 
assumptions suffice for the Wilcoxon method. If DS were free of measure- 
ment error and SRa were prone to measurement error, then least squares 
and m-estimation applied to SRa would be inconsistent, but, assuming that 
a linear model for the covariates Xj^ holds, the same methods applied to DS 
would be consistent. Table 1 asks the following: Which methods give similar 
answers with both DS and SRa? 

Although the methods differ in several respects, not solely the use of the 
milestone, and although sampling variability creates some ambiguity, it does 
appear that (i) use of SRa in least squares or m-estimation yielded a lower 
estimated return to education, and (ii) using the milestone, DS and SRa pro- 
duced similar results. Using the fallible self-report, SRa, the 95% confidence 
interval from m-estimation is [0.026, 0.035], whereas using the milestone with 
the Wilcoxon procedure, it is [0.034,0.048], so these intervals barely overlap. 
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Although two-stage least squares used more observations than the Wilcoxon 
procedure, its confidence intervals were longer, perhaps because log (income) 
does not have a Gaussian distribution [see Imbens and Rosenbaum (2005), 
Figure 2(b)], or perhaps because of the remarkable property noted by Sen 
(1968) which is directly relevant to (2) when Fd.. varies with dij. 

4. Multiple milestones. 

4.1. Definition and model: partition and reflection symmetry. In this sec- 
tion we extend the model in Section 2 in two ways. First, we allow for mul- 
tiple milestones for one variable, for instance, for years of education, twelve 
years for a high school diploma and sixteen years for a BA degree. In the 
WLS all respondents completed high school with a high school degree, so 
this milestone is not available. Second, we allow for several variables, each 
with at least one milestone. In the example in Section 5, using the WLS 
data, we will estimate the partial slopes for years of education and months 
of military service, using the BA as a milestone for sixteen years of education 
and no military service as a milestone for months of military service. 

In contrast to Section 2, there is now a P-dimensional fixed vector djj = 
{diji, . . . , dijp) of true but unobserved doses, a fallible, random P-dimensional 
observed dose Djj. In Section 5 djj = (years of education, months of mili- 
tary service). Write Sp for the set of P-dimensional, continuous multivariate 
distributions that are symmetric about 0, in the sense that if ^ ~ F S Sp, 
then ^ and — ^ have the same distribution; see Sen and Puri (1967), Snijders 
(1981) or Neuhaus and Zhu (1998). The model is 

Yij = A(xij) + 0^dij + Eij, Eij ''^ ' G G C, 

(5) 

where the eij and are mutually independent. The P coordinates of 
may be dependent; for instance, exaggerating years of education may be 
correlated with exaggerating months of military service. From (5), the dis- 
tribution of measurement errors, F^.. , varies with the true dose djj, but any 
linear combination of the components of the true dose rj'^Tiij is always sym- 
metrically distributed about its center or median, namely, ri'^dij. Matching 
is assumed to exactly control x, so that, as in Section 1.4, two individuals, 
j and k, in the same matched set, i, have Xjj = Xj^. 

Write D for the set of possible values of dij. The generalization of an 
error-free milestone is a mutually exclusive and exhaustive partition P = 
ViU-'-UVl, with Vi nVi' =0 for ^ / f such that d^ GVi^ D^j € Vg. 
Because dij is fixed, Djj is observed, and Djj € Di if and only if dij G P^; it 
follows that membership in a particular Vi is fixed and known, even though 
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dij is not observed. The case of a single milestone had M = D = Di U P2 
with Vi = {d:d < k}, 1^2 = {d:d> k}. It is assumed that the partitioning 
cuts each of the P coordinates at least once, so that the partition D = 
Pi U • • • U includes at least 2^ quadrants formed by these P cuts, which 
implies L>2^. In the example in Section 5, d = (years of education, months 
of military service) , and the partition is T> = Vi L) ■ ■ ■ L) T>4, where T>i is 
"no BA, no military service" X'2 is "no BA, some military service," is 
"BA, no military service" and X'4 is "BA, some military service." Also, in 
asymptotics, as / ^ 00, it is assumed that the fraction of observations in Di 
tends to a positive constant, 4>£> 0, for each i, where 1 = cpi + ■ ■ ■ + 4>l- 

Consider testing the null hypothesis Hq: f3 = using Yij — (S'^'Dij, com- 
paring matched individuals, j and k, in the same matched set i, where 



which is symmetric about zero \i Hq: (3 = is true. If Hq is false, then V^^-j} 
is symmetric about {(3 — (3Q)'^{dij — dj^). Of course, (6) is the multivariate 
analogue of (4). 

4.2. Optimal nonbipartite matching; vector of signed-rank statistics. We 
focus on the case of matched pairs, = 2, selected to ensure that Xji and 
Xj2 are as close as possible and that if d^i G then dj2 ^ T^i- Define a 
distance, such as the Mahalanobis distance, between values of x, and com- 
pute that distance for every possible pair of two individuals; however, if two 
individuals have D in the same T>i, then replace that distance by 00. With 
these distances, apply optimal nonbipartite matching to construct the pairs, 
as described in Lu and Rosenbaum (2004), thereby finding a pairing that 
minimizes the total distance within pairs on x subject to the constraint that 
paired individuals are in different P^. Algorithms for optimal nonbipartite 
matching are discussed by Galil (1986), Derigs (1988) and Cook and Rohe 



Recall that dij € <^4- Djj V^. If D^i € Di and Dj2 S T^e, for p = 
1, . . . ,P, define Zjp = 1 if d € Vi, d' € Vii implies dp> dp, Zjp = — 1 if d E T>i, 
d' G Vii implies dp < dp, z^p = ii d £ T>£, d' G V^' does not itself deter- 
mine the ordering of dp and dp. For instance, in Section 5, if Dji G 2^2 = 
"no BA, some military service," Dj2 G P4 = "BA, some military service," 
then Zii = —1 and Zi2 = 0. Write Zj = {zn, . . ., zipY' . Notice that the Zj are 
determined by the fixed events dij G 44> D^- G I'f , so the Zj are fixed. 

Consider the hypothesis, HQ:f3 = (3q, and let ri^^^ be the rank of |V^/i^°^| 

in (6), let Si^p^ = sign{V^^°^), and define the P-dimensional vector of signed 




(Yij - /3^D,,) - (Y.k - (3lnik) 
0^{dij -dik)- PoCDij - Difc) + {sij - Eik) 



(6) 



(/3 - /3o) (dij - difc) - Pq iCij - Cik) + {sij - Eik), 



(1999). 
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rank statistics, 

/ 

T/3o =II2^^^-/3o«^./3o• 
li Hq: P = (3q were true, then Si^js^ = sign(V^]^^"^) and iV^i^"^! would be in- 
dependent; again, see Wolfe (1974), Corollary 2.1. Consider the conditional 

distribution of T^^^ given the under Hq: (5 = (Bq, this distribution 

has E(T^^^) = and P x P covariance matrix 

(7) var(T^J=^r2^^,z,zf. 

j=i 

This variance formula (7) depends upon the continuous distribution of V^^^°^ , 
which ensures |V^if | > and |si,/3Q | = 1 with probability one. In the presence 
of ties, use average ranks for tied ranks, and use var(T^^) = J2i=i \^i,l3o Vf fi^'^i'^J ■ 
U Hq: P = (3q were true, then T^^^{var(T^^J}~^T^|j would tend to the chi- 
square distribution on P degrees of freedom [Sen and Puri (1967)], and from 
this a confidence set for f3 is found by inverting the test. The point estimate 
of j3 minimizes T^^{var(T^^)}~^T^|j as a function of /3q. 

5. An example with multiple milestones: returns to education and mili- 
tary service. In the WLS data of Section 1.6, with dij = (years of education 
in DS, true months of military service), with the BA degree as a milestone 
for 16 years of education and with no military service as a milestone for 
zero years of service, the partial regression coefficients /3 = (Ped^Pms)'^ will 
be estimated from the fallible self report Djj = (years of education in SRa, 
measured months of military service). As in Section 3, men were matched for 
a 4-dimensional x consisting of IQ in high school, father's education in years, 
mother's education in years and the population size of the town in which 
the individuals attended high school. From the 3738 men, we formed 1000 
pairs of two men by optimal nonbipartite matching, as described in Section 
4.2, where the distance was the Mahalanobis distance computed from the 
ranks of the four variables and from two indicators for missing parental ed- 
ucation. The matching resulted in 230 pairs whose members differ on which 
side of both the BA and military service milestones they are on, 199 pairs 
whose members differ only on which side of the BA milestone they are on 
and 571 pairs whose members differ only on which side of the military ser- 
vice milestone they are on. Among the 230 pairs whose members differed 
on both BA and military service, in 143 pairs, one member had a BA and 
no military service and the other member had no BA and military service, 
and in 87 pairs, one member had a BA and military service and the other 
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member had neither a BA nor mihtary service. Boxplots similar to Figure 
1 but not included show that the differences on x are nearly zero within 
matched pairs. 

We now compare the least squares, m-estimate and milestone estimates of 
percent returns to an additional year of education and an additional month 
of military service. The least squares estimates and m-estimates regress log 
wages on self reported education (SRa), months of military service, IQ, fa- 
ther's education, mother's education and scaled hometown population and 
lose 445 men due to missing data on parent's education or months of mil- 
itary service, leaving 3293 men. The milestone estimates are based on the 
nonbipartite matching described above of 1000 pairs of two men, matching 
missing data to missing data, using the methods in Section 4.2. 

Figure 4 plots the three 95% confidence sets /3 = {Ped, Pms)'^ ■ As in Section 
3, the milestone method suggests the returns to education, /Sgd, are higher 
than the two regression methods that ignore measurement error. Specifically, 
for least squares /3ed is about a 2% increase in earnings per year of education, 
for m-estimation (3ed is about 3% per year, and for the milestone method (3ed 
is about 4% per year. For military service, the milestone method suggests 
(3ms might be zero, whereas the regression methods suggest reduced earnings. 
Table 2 presents numerical results. The confidence intervals for the milestone 




Fig. 4. 95% confidence sets for {Ped, Pms)'^ by three methods. 
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Table 2 

Estimates of the percent returns to an additional year of education (ed) or an additional 
month of military service (ms). The two stage least squares estimates use receipt of a BA 
and whether the man served in the military as instrumental variables 



Method 


Sample size 


f3ed 


se 


95% CI 


Education 










Least squares 


3293 


0.02214 


0.00363 


[0.01502,0.02927] 


m-estimation 


3293 


0.02967 


0.00246 


[0.02485,0.03449] 


Milestone 


2000 


0.03820 


0.00679 


[0.02495,0.05155] 


TSLS 


3293 


0.03587 


0.00436 


[0.02733,0.04441] 


Method 


Sample size 


f^ms 


se 


95% CI 


Military 










Least squares 


3293 


-0.00056 


0.00025 


[-0.00104,-0.00007] 


m-estimation 


3293 


-0.00042 


0.00017 


[-0.00075, -0.00009] 


Milestone 


2000 


0.00009 


0.00024 


[-0.00039,0.00057] 


TSLS 


3293 


0.00032 


0.00046 


[-0.00058,0.00121] 



method are projections of the confidence set, so their simultaneous coverage 
is 95%. 

6. Summary. For use with measurement error, there are several meth- 
ods for using strong, valid instrumental variables [e.g., Cheng and Van Ness 
(1999), Section 4.2], but few methods for constructing them. Error-free mile- 
stones in error prone measurements create instrumental variables. In the 
Alzheimer's disease example in Section 1.2, the dose of anesthesia is mea- 
sured with error, except for the zero doses of patients who did not have 
surgery. In the combat exposure scale example in Section 1.3, certain points 
on the scale are anchored by military records, while others depend on self 
report and memory, the latter being far more prone to error. In surveys in 
Section 1.4, a scaled response with aspects prone to error because of sub- 
tle operational definitions or memory lapses may sometimes be anchored by 
events that are difficult to misunderstand or forget, such as events marked by 
public ceremony or official sanction. Although various generalizations were 
mentioned, the discussion has focused on a predictor that has errors which 
are symmetric about zero yet respect a milestone, and in this case, exact, 
nonparametric inference was developed. 
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